Skip to content

perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599

Open
Changqing-JING wants to merge 3 commits intoWebAssembly:mainfrom
Changqing-JING:opt/compile-speed
Open

perf: avoid O(N^2) exiting-branch checks in CodeFolding#8599
Changqing-JING wants to merge 3 commits intoWebAssembly:mainfrom
Changqing-JING:opt/compile-speed

Conversation

@Changqing-JING
Copy link
Copy Markdown
Contributor

@Changqing-JING Changqing-JING commented Apr 14, 2026

Follow up PR of #8586 to optimize CodeFolding

optimizeTerminatingTails calls EffectAnalyzer per tail item, each walking the full subtree. On deeply nested blocks this is O(N^2).

Replace the per-item walks with a single O(N) bottom-up PostWalker (populateExitingBranchCache) that pre-computes exiting-branch results for every node, making subsequent lookups O(1).

Example: AssemblyScript GC compiles __visit_members as a br_table dispatch over all types, producing ~N nested blocks with ~N tails. The old code walks each tail's subtree separately -- O(N^2) total node visits. With this change, one bottom-up walk covers all nodes, then each tail lookup is O(1).

(block $A          ;; depth 4000
  (block $B        ;; depth 3999
    (block $C      ;; depth 3998
      ...
      (br_table $A $B $C ... (local.get $rtid))
    )
    (unreachable)  ;; tail at depth 3999, old code walks 3999 nodes
  )
  (unreachable)    ;; tail at depth 4000, old code walks 4000 nodes
)

benchmark data
The test module is from issue #7319
#7319 (comment)

In main head

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    9m16.111s
user    35m33.985s
sys     0m51.000s

In the PR

time ./build/bin/wasm-opt -Oz --enable-bulk-memory --enable-multivalue --enable-reference-types --enable-gc --enable-tail-call --enable-exception-handling  -o /dev/null ./test3.wasm

real    5m17.170s
user    30m9.198s
sys     0m28.030s

@Changqing-JING Changqing-JING requested a review from a team as a code owner April 14, 2026 03:38
@Changqing-JING Changqing-JING requested review from kripken and removed request for a team April 14, 2026 03:38
@Changqing-JING Changqing-JING marked this pull request as draft April 14, 2026 03:38
Comment thread src/passes/CodeFolding.cpp Outdated
Comment thread src/passes/CodeFolding.cpp Outdated
Comment thread src/passes/CodeFolding.cpp Outdated
// efficient bottom-up traversal.
bool hasExitingBranches(Expression* expr) {
if (!exitingBranchCachePopulated_) {
populateExitingBranchCache(getFunction()->body);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this still scans the entire function. I suggest that we only scan expr itself. That will still avoid re-computing things, but avoid scanning things that we never need to look at.

This does require that the cache store a bool, so we know if we scanned or not, and if we did, if we found branches out or not. But I think that is worth it - usually we will scan very few things.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-expression cache would still be O(N^2) in the nested block case. AssemblyScript GC emits __visit_members with deeply nested blocks + br_table, where the nesting level equals the number of classes (4000+ in real apps). Each nested block gets queried by optimizeTerminatingTails, and each query walks its overlapping subtree independently, giving O(N + (N-1) + ... + 1) = O(N^2) total work even with the cache.

We also cannot reuse a child's cached bool to compute a parent's result, because knowing "child has exiting branches" does not tell us which names exit -- the parent may define/resolve some of them. To compose results bottom-up, we would need to store the full set of unresolved names per expression. I benchmarked that approach (storing unordered_map<Expression*, unordered_set> and propagating name sets upward), but the per-node set allocation overhead on millions of nodes made -Oz significantly slower than the baseline (~13min vs ~5min).

The whole-function scan avoids both issues by computing all results in a single O(N) pass using only integer counters, with no per-node name storage.

@Changqing-JING Changqing-JING requested a review from kripken April 16, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants